This session is adapted from the Data Carpentry lesson “Data visualization with ggplot2”: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html
The examples in this session show how to generate and customize plots using ggplot2 in R, using an ecology dataset containing survey data on animal species diversity and weights within a study site.
We download and load the dataset in .csv format from
GitHub. The saved .csv file on GitHub includes all
preprocessing from the earlier chapters in the Data Carpentry
materials.
Code for the preprocessing steps is also included in the previous script, available on GitHub.
library(tidyverse)
library(here)
library(ggplot2)
library(hexbin)
# download saved dataset from GitHub
download.file(
url = "https://raw.githubusercontent.com/lmweber/PHDS-data-visualization-2024/main/data/surveys_complete.csv",
destfile = here("data/surveys_complete.csv")
)
# load data
surveys_complete <- read_csv(here("data/surveys_complete.csv"))
## Rows: 30463 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Have a look at the dataset.
head(surveys_complete)
## # A tibble: 6 × 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 845 5 6 1978 2 NL M 32 204
## 2 1164 8 5 1978 2 NL M 34 199
## 3 1261 9 4 1978 2 NL M 32 197
## 4 1756 4 29 1979 2 NL M 33 166
## 5 1818 5 30 1979 2 NL M 32 184
## 6 1882 7 4 1979 2 NL M 32 206
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
Generate an initial scatter plot to visualize the data.
Note the syntax used for the ggplot() function. The
first argument (data) provides the input data frame, and
the second argument (mapping = aes()) specifies the mapping
of data variables to plot aesthetics (in this case the x
and y axes), followed by the + operator to
provide additional functions, and the geom_point() function
to specify that we want to plot the data as points.
# generate an initial scatter plot
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
geomWe can also specify a different geom (geometric object)
to plot the data in a different way.
This demonstrates how easily ggplot allows us to change the type of plot, once we have provided the data and specified the mapping of data variables to plot aesthetics.
# 'geom_hex()'
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_hex()
We can modify the plot by providing additional arguments.
For example, the alpha argument specifies a level of
transparency of points.
# 'alpha' argument
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1)
The color argument sets colors.
# 'color' argument
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, color = "blue")
We can also use the color argument within
aes() to color points by the values of a variable. Note
that we can specify color within aes() either
at the top level (within the ggplot2() function) or within
the geom_point() function. This gives flexibility for
setting colors only for certain geoms or for the whole
plot.
# set 'color' using categorical variable
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, aes(color = species_id))
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length,
color = species_id)) +
geom_point(alpha = 0.1)
Generate boxplots using geom_boxplot().
# boxplots
ggplot(surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot()
Add points and additional formatting arguments to the boxplots. Note the order in which the plot elements are added affects the output.
# boxplots with points and additional formatting
ggplot(surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.3, color = "tomato")
# note order of plot elements
ggplot(surveys_complete, aes(x = species_id, y = weight)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA)
# additional formatting
ggplot(surveys_complete, aes(x = species_id, y = weight)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA, fill = NA)
Alternatively, we can use violin plots. (Which of these alternative visualizations seems to be the most effective for this dataset?)
# violin plots
ggplot(surveys_complete, aes(x = species_id, y = weight)) +
geom_violin()
ggplot(surveys_complete, aes(x = species_id, y = weight)) +
geom_violin(fill = "#40E0D0")
To demonstrate line plots for time series data, we first calculate an additional variable.
# calculate number of counts per year for each genus
yearly_counts <-
surveys_complete |>
count(year, genus)
head(yearly_counts)
## # A tibble: 6 × 3
## year genus n
## <dbl> <chr> <int>
## 1 1977 Chaetodipus 3
## 2 1977 Dipodomys 222
## 3 1977 Onychomys 1
## 4 1977 Perognathus 22
## 5 1977 Peromyscus 2
## 6 1977 Reithrodontomys 2
To generate line plots, we can specify group in
aes(). However, this does not allow us to identify the
lines.
# line plot
ggplot(yearly_counts, aes(x = year, y = n, group = genus)) +
geom_line()
We can use color to identify the lines. Note that
setting color in aes() also automatically
groups the data.
# color by 'genus' variable
ggplot(yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line()
In ggplot, facetting refers to splitting plots into
multiple panels according to some variable. This can be very useful, and
requires a special syntax.
Here, we use facet_wrap() to facet the line plots by
genus.
# facet by 'genus'
ggplot(yearly_counts, aes(x = year, y = n)) +
geom_line() +
facet_wrap(vars(genus))
We can also include color. However, in this case we have some redundancy between the facetting and the colors. Depending on the dataset, this type of redundancy may either help the reader or overcomplicate things.
# facet by 'genus' and include color
ggplot(yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line() +
facet_wrap(vars(genus))
Alternatively, we can use facetting and further split the lines by
sex. To do this, we need to do some further data
manipulation.
# split counts by sex
yearly_sex_counts <-
surveys_complete |>
count(year, genus, sex)
Now we can create a facetted plot with lines split and colored by the
sex variable.
# facetted plot
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus))
We can also facet by multiple variables using
facet_grid(). In this case, the plot is starting to get
quite busy and becoming more difficult to read.
# facet by multiple variables using 'facet_grid()'
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(sex), cols = vars(genus))
Facets may be organized by either rows or columns.
# facet by rows with one column
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(genus))
# facet by columns with one row
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(cols = vars(genus))
Themes provide a convenient way to format plots in ggplot2.
For example, theme_bw() (for “black and white”) adjusts
the formatting to show white backgrounds, black borders, gray gridlines,
and additional default settings. There are also several other themes
available.
# demonstrate themes
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_bw()
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_classic()
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_minimal()
Additional functions are available to further customize the plot. For example, we can add more informative plot titles and axis titles.
# specify plot title and axis titles
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw()
The theme() function can be used to add further
customizations, such as font size. Note the special syntax using
element_text().
# using 'theme()' to adjust font size
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(text = element_text(size = 16))
Numerous additional options are available for detailed customizations in ggplot.
# demonstrate additional options for 'theme()'
ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(color = "grey20", size = 12, angle = 90,
hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(color = "gray20", size = 12),
strip.text = element_text(face = "italic"),
text = element_text(size = 16))
Here, we demonstrate the use of the patchwork package to
arrange multiple plots in panels.
First, we load the patchwork package.
library(patchwork)
The patchwork package uses a specific syntax
(| for columns, / for rows).
We start by creating some plots and storing them by assigning them to variables.
# create plots and assign to variables
plot_weight <- ggplot(surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot() +
labs(x = "Species", y = expression(log[10](Weight))) +
scale_y_log10()
plot_count <- ggplot(yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line() +
labs(x = "Year",
y = "Abundance")
Now, we can display the plots in panels using the
patchwork syntax.
# display plots in rows
plot_weight / plot_count
# modify row heights
plot_weight / plot_count +
plot_layout(heights = c(3, 2))
# display plots in columns
plot_weight | plot_count
Finally, we can export or save plots using the ggsave()
function. Usually we will save plots in either .png or
.pdf format. The ggsave() function detects the
format automatically from the file name.
# generate plot
my_plot <- ggplot(yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(color = "gray20", size = 12, angle = 90,
hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(color = "gray20", size = 12),
text = element_text(size = 16))
# display plot
my_plot
Save plot in .png format.
# save plot in .png format
ggsave(here("plots/my_plot.png"), my_plot, width = 15, height = 10)
Save plot in .pdf format.
# save plot in .png format
ggsave(here("plots/my_plot.pdf"), my_plot, width = 15, height = 10)
We can also specify the resolution for .png format.
Here, we save a combined plot using patchwork and
ggsave() in .png format with a customized
resolution.
# save combined plot with custom resolution in .png format
plot_combined <-
plot_weight / plot_count +
plot_layout(heights = c(3, 2))
ggsave(here("plots/plot_combined.png"), plot_combined, width = 10, dpi = 300)
## Saving 10 x 5 in image